TCGA Expedition: A Data Acquisition and Management System for TCGA Data

نویسندگان

  • Uma R. Chandran
  • Olga P. Medvedeva
  • M. Michael Barmada
  • Philip D. Blood
  • Anish Chakka
  • Soumya Luthra
  • Antonio Ferreira
  • Kim F. Wong
  • Adrian V. Lee
  • Zhihui Zhang
  • Robert Budden
  • J. Ray Scott
  • Annerose Berndt
  • Jeremy M. Berg
  • Rebecca S. Jacobson
چکیده

BACKGROUND The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices. RESULTS TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable. CONCLUSION Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data.

Motivation The Cancer Genome Atlas (TCGA) program has produced huge amounts of cancer genomics data providing unprecedented opportunities for research. In 2014, we developed TCGA-Assembler (Zhu et al., 2014), a software pipeline for retrieval and processing of public TCGA data. In 2016, TCGA data were transferred from the TCGA data portal to the Genomic Data Commons (GDC), which is supported by...

متن کامل

Augmented expression levels of lncRNAs ecCEBPA and UCA1 in gastric cancer tissues and their clinical significance

Objective(s): As the second cause of cancer death, gastric cancer (GC) is one of the eminent dilemmas all over the world, therefore investigating the molecular mechanisms involved in this cancer is pivotal. Unrestricted proliferation is one of the characteristics of cancerous cells, which is due to deficiency in cell regulatory systems. Long non-coding RNAs (lncRNAs) have emerged as critical re...

متن کامل

RTCGAToolbox: A New Tool for Exporting TCGA Firehose Data

BACKGROUND & OBJECTIVE Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent st...

متن کامل

TCGA - Assembler : Pipeline for TCGA Data Downloading , Assembling , and Processing ( Supplementary Methods )

The Cancer Genome Atlas (TCGA) is supported by the National Cancer Institute and the National Human Genome Research Institute to chart the molecular landscape of tumor samples for more than 20 types of cancer [1-3]. TCGA has been generating multi-modal genomics, epigenomics, and proteomics data for thousands of cancer patients, providing unprecedented opportunities for researchers to systematic...

متن کامل

Extending TCGA queries to automatically identify analogous genomic data from dbGaP

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have cr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2016